Hemlata Channe

Project 2: Supervised Learning - Classification

Data Description:

Data Description & Context: Parkinson’s Disease (PD) is a degenerative neurological disorder marked by decreased dopamine levels in the brain. It manifests itself through a deterioration of movement, including the presence of tremors and stiffness. There is commonly a marked effect on speech, including dysarthria (difficulty articulating sounds), hypophonia (lowered volume), and monotone (reduced pitch range). Additionally, cognitive impairments and changes in mood can occur, and risk of dementia is increased. Traditional diagnosis of Parkinson’s Disease involves a clinician taking a neurological history of the patient and observing motor skills in various situations. Since there is no definitive laboratory test to diagnose PD, diagnosis is often difficult, particularly in the early stages when motor effects are not yet severe. Monitoring progression of the disease over time requires repeated clinic visits by the patient. An effective screening process, particularly one that doesn’t require a clinic visit, would be beneficial. Since PD patients exhibit characteristic vocal features, voice recordings are a useful and non-invasive tool for diagnosis. If machine learning algorithms could be applied to a voice recording dataset to accurately diagnosis PD, this would be an effective screening step prior to an appointment with a clinician

Domain: Medical

Attribute Information:

name - ASCII subject name and recording number,

MDVP:Fo(Hz) - Average vocal fundamental frequency

MDVP:Fhi(Hz) - Maximum vocal fundamental frequency

MDVP:Flo(Hz) - Minimum vocal fundamental frequency

MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency

MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude

NHR,HNR - Two measures of ratio of noise to tonal components in the voice

status - Health status of the subject (one) - Parkinson's, (zero) - healthy

RPDE,D2 - Two nonlinear dynamical complexity measures

DFA - Signal fractal scaling exponent

spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation

Learning Outcomes:

Exploratory Data Analysis Supervised Learning Ensemble Learning

Objective:

Goal is to classify the patients into the respective labels using the attributes from their voice recordings

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt       # matplotlib.pyplot plots data
%matplotlib inline 

import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
import sklearn.tree as DecisionTreeClassifier
from scipy.stats import zscore
from sklearn import metrics

Task 1: 1. Load the dataset

In [4]:
Data = pd.read_csv("Data-Parkinsons")
Data.head()
Out[4]:
name MDVP:Fo(Hz) MDVP:Fhi(Hz) MDVP:Flo(Hz) MDVP:Jitter(%) MDVP:Jitter(Abs) MDVP:RAP MDVP:PPQ Jitter:DDP MDVP:Shimmer ... Shimmer:DDA NHR HNR status RPDE DFA spread1 spread2 D2 PPE
0 phon_R01_S01_1 119.992 157.302 74.997 0.00784 0.00007 0.00370 0.00554 0.01109 0.04374 ... 0.06545 0.02211 21.033 1 0.414783 0.815285 -4.813031 0.266482 2.301442 0.284654
1 phon_R01_S01_2 122.400 148.650 113.819 0.00968 0.00008 0.00465 0.00696 0.01394 0.06134 ... 0.09403 0.01929 19.085 1 0.458359 0.819521 -4.075192 0.335590 2.486855 0.368674
2 phon_R01_S01_3 116.682 131.111 111.555 0.01050 0.00009 0.00544 0.00781 0.01633 0.05233 ... 0.08270 0.01309 20.651 1 0.429895 0.825288 -4.443179 0.311173 2.342259 0.332634
3 phon_R01_S01_4 116.676 137.871 111.366 0.00997 0.00009 0.00502 0.00698 0.01505 0.05492 ... 0.08771 0.01353 20.644 1 0.434969 0.819235 -4.117501 0.334147 2.405554 0.368975
4 phon_R01_S01_5 116.014 141.781 110.655 0.01284 0.00011 0.00655 0.00908 0.01966 0.06425 ... 0.10470 0.01767 19.649 1 0.417356 0.823484 -3.747787 0.234513 2.332180 0.410335

5 rows × 24 columns

In [8]:
Data.shape
Out[8]:
(195, 24)
In [ ]:
 

Task2: 2. It is always a good practice to eye-ball raw data to get a feel of the data in terms of number of records, structure of the file, number of attributes, types of attributes and a general idea of likely challenges in the dataset.

Mention a few comments in this regard (5 points)

In [5]:
Data.columns
Out[5]:
Index(['name', 'MDVP:Fo(Hz)', 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)', 'MDVP:Jitter(%)',
       'MDVP:Jitter(Abs)', 'MDVP:RAP', 'MDVP:PPQ', 'Jitter:DDP',
       'MDVP:Shimmer', 'MDVP:Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5',
       'MDVP:APQ', 'Shimmer:DDA', 'NHR', 'HNR', 'status', 'RPDE', 'DFA',
       'spread1', 'spread2', 'D2', 'PPE'],
      dtype='object')
In [6]:
Data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              195 non-null    object 
 1   MDVP:Fo(Hz)       195 non-null    float64
 2   MDVP:Fhi(Hz)      195 non-null    float64
 3   MDVP:Flo(Hz)      195 non-null    float64
 4   MDVP:Jitter(%)    195 non-null    float64
 5   MDVP:Jitter(Abs)  195 non-null    float64
 6   MDVP:RAP          195 non-null    float64
 7   MDVP:PPQ          195 non-null    float64
 8   Jitter:DDP        195 non-null    float64
 9   MDVP:Shimmer      195 non-null    float64
 10  MDVP:Shimmer(dB)  195 non-null    float64
 11  Shimmer:APQ3      195 non-null    float64
 12  Shimmer:APQ5      195 non-null    float64
 13  MDVP:APQ          195 non-null    float64
 14  Shimmer:DDA       195 non-null    float64
 15  NHR               195 non-null    float64
 16  HNR               195 non-null    float64
 17  status            195 non-null    int64  
 18  RPDE              195 non-null    float64
 19  DFA               195 non-null    float64
 20  spread1           195 non-null    float64
 21  spread2           195 non-null    float64
 22  D2                195 non-null    float64
 23  PPE               195 non-null    float64
dtypes: float64(22), int64(1), object(1)
memory usage: 36.7+ KB
In [18]:
Data.describe().transpose
Out[18]:
<bound method DataFrame.transpose of        MDVP:Fo(Hz)  MDVP:Fhi(Hz)  MDVP:Flo(Hz)  MDVP:Jitter(%)  \
count   195.000000    195.000000    195.000000      195.000000   
mean    154.228641    197.104918    116.324631        0.006220   
std      41.390065     91.491548     43.521413        0.004848   
min      88.333000    102.145000     65.476000        0.001680   
25%     117.572000    134.862500     84.291000        0.003460   
50%     148.790000    175.829000    104.315000        0.004940   
75%     182.769000    224.205500    140.018500        0.007365   
max     260.105000    592.030000    239.170000        0.033160   

       MDVP:Jitter(Abs)    MDVP:RAP    MDVP:PPQ  Jitter:DDP  MDVP:Shimmer  \
count        195.000000  195.000000  195.000000  195.000000    195.000000   
mean           0.000044    0.003306    0.003446    0.009920      0.029709   
std            0.000035    0.002968    0.002759    0.008903      0.018857   
min            0.000007    0.000680    0.000920    0.002040      0.009540   
25%            0.000020    0.001660    0.001860    0.004985      0.016505   
50%            0.000030    0.002500    0.002690    0.007490      0.022970   
75%            0.000060    0.003835    0.003955    0.011505      0.037885   
max            0.000260    0.021440    0.019580    0.064330      0.119080   

       MDVP:Shimmer(dB)  ...  Shimmer:DDA         NHR         HNR      status  \
count        195.000000  ...   195.000000  195.000000  195.000000  195.000000   
mean           0.282251  ...     0.046993    0.024847   21.885974    0.753846   
std            0.194877  ...     0.030459    0.040418    4.425764    0.431878   
min            0.085000  ...     0.013640    0.000650    8.441000    0.000000   
25%            0.148500  ...     0.024735    0.005925   19.198000    1.000000   
50%            0.221000  ...     0.038360    0.011660   22.085000    1.000000   
75%            0.350000  ...     0.060795    0.025640   25.075500    1.000000   
max            1.302000  ...     0.169420    0.314820   33.047000    1.000000   

             RPDE         DFA     spread1     spread2          D2         PPE  
count  195.000000  195.000000  195.000000  195.000000  195.000000  195.000000  
mean     0.498536    0.718099   -5.684397    0.226510    2.381826    0.206552  
std      0.103942    0.055336    1.090208    0.083406    0.382799    0.090119  
min      0.256570    0.574282   -7.964984    0.006274    1.423287    0.044539  
25%      0.421306    0.674758   -6.450096    0.174351    2.099125    0.137451  
50%      0.495954    0.722254   -5.720868    0.218885    2.361532    0.194052  
75%      0.587562    0.761881   -5.046192    0.279234    2.636456    0.252980  
max      0.685151    0.825288   -2.434031    0.450493    3.671155    0.527367  

[8 rows x 23 columns]>

Findings from column description

1. There are 24 columns. All the columns except name are numeric and the dataset has 195 rows.

2. Dataset does not have any null values.

3. Status column is the target column.

4. Some of the columns shows outliers from the observation of mean value and its max value. e.g. the column 'MDVP:Fhi(Hz)' has outliers.

Step 3: Using univariate & bivariate analysis to check the individual attributes for their basic statistics such as central values, spread, tails, relationships between variables etc. mention your observations (15 points)

Plotted the boxplot of all columns to see the univariate distribution.

Boxplot shows outliers in 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)', 'MDVP:Shimmer', 'MDVP:Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5', 'MDVP:APQ', 'HNR', 'spread1' attributes.

Those boxplots are observed separately using seaborn boxplots also.

Plotted histogram of all attributes

Following are the observations from histogram

1. The columns 'FINR', 'Jitter.DDP','MDVP.DDQ', 'Shimmer.APQ3','Shimmer.APQ5', 'spread1','spread2' are normally distributed.

2. Attributes which shows skewness to the right (right tailed) are:

'MDVP.Fhi(Hz)', 'MDVP.Flo(Hz)', 'MDVP.Fo(Hz)','MDVP.PPQ', 'MDVP:RAP', 'MDVP:PPQ' , 'Jitter:DDP', 'MDVP:Shimmer', 'MDVP:Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5', 'MDVP:APQ', 'Shimmer:DDA', 'NHR'

Plotted pairplot of all the columns. It shows no any linear relationship between the other columns and target column.

In [22]:
plt.figure(figsize=(12,10))
Data.iloc[:,0:9].boxplot()
plt.show()
In [28]:
plt.figure(figsize=(12,10))
Data.iloc[:,9:14].boxplot()
plt.show()
In [29]:
plt.figure(figsize=(12,10))
Data.iloc[:,14:20].boxplot()
plt.show()
In [30]:
plt.figure(figsize=(12,10))
Data.iloc[:,20:25].boxplot()
plt.show()
In [35]:
# columns to observe 9 for outliers
# 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)', 'MDVP:Shimmer', 'MDVP:Shimmer(dB)', 
# 'Shimmer:APQ3', 'Shimmer:APQ5',  'MDVP:APQ',  'HNR', 'Spread1' attributes.
fig, ax =plt.subplots(1,9, figsize = (18,5))
sns.boxplot(Data['MDVP:Fhi(Hz)'], ax=ax[0])
sns.boxplot(Data['MDVP:Flo(Hz)'], ax =ax[1])
sns.boxplot(Data['MDVP:Shimmer'], ax = ax[2])
sns.boxplot(Data['MDVP:Shimmer(dB)'], ax = ax[3])
sns.boxplot(Data['Shimmer:APQ3'], ax = ax[4])
sns.boxplot(Data['Shimmer:APQ5'], ax = ax[5])
sns.boxplot(Data['MDVP:APQ'], ax = ax[6])
sns.boxplot(Data['HNR'], ax = ax[7])
sns.boxplot(Data['spread1'], ax = ax[8])
plt.show()

Plotted histogram of all attributes

Following are the observations from histogram

1. The columns 'FINR', 'Jitter.DDP','MDVP.DDQ', 'Shimmer.APQ3','Shimmer.APQ5', 'spread1','spread2' are normally distributed.

2. Attributes which shows skewness to the right are (right tailed):

'MDVP.Fhi(Hz)', 'MDVP.Flo(Hz)', 'MDVP.Fo(Hz)','MDVP.PPQ', 'MDVP:RAP', 'MDVP:PPQ' , 'Jitter:DDP', 'MDVP:Shimmer', 'MDVP:Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5', 'MDVP:APQ', 'Shimmer:DDA', 'NHR'

In [36]:
Data.hist(stacked=False, bins=100, figsize=(12,30), layout=(14,2)); 
In [43]:
cols = ['MDVP:Fo(Hz)', 'MDVP:Jitter(%)','MDVP:Shimmer', 'MDVP:Shimmer(dB)', 'HNR','RPDE', 'DFA',
       'spread1', 'spread2', 'D2', 'PPE', 'status']
sns.pairplot(Data[cols])
Out[43]:
<seaborn.axisgrid.PairGrid at 0x7f6037fc25d0>
In [38]:
sns.pairplot(Data)
Out[38]:
<seaborn.axisgrid.PairGrid at 0x7f604b939e50>
In [ ]:
 

Task4: Split the data into training and test set in the ratio of 70:30 respectively (5)

In [53]:
x = Data.drop(['status','name'],axis=1) 
y = Data['status']
In [ ]:
 
In [54]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)
Data.shape,x_train.shape,x_test.shape
Out[54]:
((195, 24), (136, 22), (59, 22))

Task5: Prepare the data for training - Scale the data if necessary, get rid of missing values (if any) etc (5 points)

1. There are no any missing values in this dataset.

2. Data scaling

In [55]:
XScaled  = x.apply(zscore)  # convert all attributes to Z scale 

XScaled.describe()
Out[55]:
MDVP:Fo(Hz) MDVP:Fhi(Hz) MDVP:Flo(Hz) MDVP:Jitter(%) MDVP:Jitter(Abs) MDVP:RAP MDVP:PPQ Jitter:DDP MDVP:Shimmer MDVP:Shimmer(dB) ... MDVP:APQ Shimmer:DDA NHR HNR RPDE DFA spread1 spread2 D2 PPE
count 1.950000e+02 1.950000e+02 1.950000e+02 1.950000e+02 1.950000e+02 1.950000e+02 1.950000e+02 1.950000e+02 1.950000e+02 1.950000e+02 ... 1.950000e+02 1.950000e+02 1.950000e+02 1.950000e+02 1.950000e+02 1.950000e+02 1.950000e+02 1.950000e+02 1.950000e+02 1.950000e+02
mean 3.529940e-17 -2.237526e-16 1.309494e-16 -2.127927e-17 2.562053e-18 -1.380662e-16 9.351494e-17 1.015569e-16 2.818258e-16 -1.374969e-16 ... -8.824850e-17 -1.577086e-16 5.152574e-17 8.770762e-16 -1.913000e-16 5.687758e-16 1.184451e-15 -1.429056e-16 -6.117614e-16 -2.960595e-17
std 1.002574e+00 1.002574e+00 1.002574e+00 1.002574e+00 1.002574e+00 1.002574e+00 1.002574e+00 1.002574e+00 1.002574e+00 1.002574e+00 ... 1.002574e+00 1.002574e+00 1.002574e+00 1.002574e+00 1.002574e+00 1.002574e+00 1.002574e+00 1.002574e+00 1.002574e+00 1.002574e+00
min -1.596162e+00 -1.040581e+00 -1.171366e+00 -9.389487e-01 -1.064103e+00 -8.872543e-01 -9.180440e-01 -8.873331e-01 -1.072340e+00 -1.014787e+00 ... -9.993055e-01 -1.097815e+00 -6.002051e-01 -3.045707e+00 -2.333888e+00 -2.605676e+00 -2.097268e+00 -2.647338e+00 -2.510472e+00 -1.802384e+00
25% -8.879183e-01 -6.820590e-01 -7.379376e-01 -5.708520e-01 -6.898141e-01 -5.561906e-01 -5.764609e-01 -5.557071e-01 -7.020291e-01 -6.881025e-01 ... -6.508513e-01 -7.326182e-01 -4.693595e-01 -6.089102e-01 -7.449206e-01 -7.852617e-01 -7.041503e-01 -6.269844e-01 -7.404100e-01 -7.687420e-01
50% -1.317379e-01 -2.331437e-01 -2.766579e-01 -2.647942e-01 -4.018994e-01 -2.724216e-01 -2.748504e-01 -2.736279e-01 -3.583019e-01 -3.151160e-01 ... -3.444009e-01 -2.841460e-01 -3.271036e-01 4.508553e-02 -2.490033e-02 7.527941e-02 -3.353960e-02 -9.166005e-02 -5.315145e-02 -1.390580e-01
75% 6.913210e-01 2.969710e-01 5.458200e-01 2.366858e-01 4.618447e-01 1.785683e-01 1.848331e-01 1.784870e-01 4.346898e-01 3.485429e-01 ... 3.146448e-01 4.543110e-01 1.966835e-02 7.225273e-01 8.587132e-01 7.932500e-01 5.869042e-01 6.337615e-01 6.668912e-01 5.165137e-01
max 2.564598e+00 4.327631e+00 2.829908e+00 5.570985e+00 6.220139e+00 6.125892e+00 5.862742e+00 6.126923e+00 4.751617e+00 5.246243e+00 ... 6.726438e+00 4.029746e+00 7.192738e+00 2.528321e+00 1.800007e+00 1.942048e+00 2.989093e+00 2.692370e+00 3.376831e+00 3.569059e+00

8 rows × 22 columns

In [56]:
x_train, x_test, y_train, y_test = train_test_split(XScaled, y, test_size=0.3, random_state=42)
Data.shape,x_train.shape,x_test.shape
Out[56]:
((195, 24), (136, 22), (59, 22))

Task 6: Train at least 3 standard classification algorithms - Logistic Regression, Naive Bayes’, SVM, k-NN etc, and note down their accuracies on the test data (10 points)

1. Logistic Regression classifier

In [80]:
# Fit the model on train
clf_lr = LogisticRegression(solver="liblinear")
clf_lr.fit(x_train, y_train)
#predict on test
y_predict = clf_lr.predict(x_test)


coef_df = pd.DataFrame(clf_lr.coef_)
coef_df['intercept'] = clf_lr.intercept_
print(coef_df)
model_score = clf_lr.score(x_test, y_test)
print("Train Accuracy with Logistic Regression = ",clf_lr.score(x_train, y_train) )
print("Test Accuracy with Logistic Regression = ",model_score)
          0         1         2         3         4         5         6  \
0 -0.379463 -0.379835 -0.263709 -0.513128 -0.347027  0.322513 -0.067951   

          7         8         9  ...        13        14        15        16  \
0  0.321907  0.122427  0.157848  ... -0.121339 -0.260451  0.024935 -0.247376   

         17        18        19        20        21  intercept  
0  0.193041  0.588545  0.224297  1.135628  0.802378   1.993026  

[1 rows x 23 columns]
Train Accuracy with Logistic Regression =  0.8676470588235294
Test Accuracy with Logistic Regression =  0.8813559322033898
In [ ]:
 
In [ ]:
 

2. K-NN Model

In [81]:
from sklearn.neighbors import KNeighborsClassifier
clf_NNH = KNeighborsClassifier(n_neighbors= 7 , weights = 'distance' )
clf_NNH.fit(x_train,y_train)
predicted_labels = clf_NNH.predict(x_test)
print("Train Accuracy with KNN = ",clf_NNH.score(x_train, y_train) )
print("Test Accuracy with KNN = ",clf_NNH.score(x_test, y_test))
Train Accuracy with KNN =  1.0
Test Accuracy with KNN =  0.8983050847457628

3. Naive Bayes Model

In [82]:
from sklearn.naive_bayes import GaussianNB
clf_NB = GaussianNB()
clf_NB.fit(x_train, y_train)
#predict on test
y_predict_NB = clf_NB.predict(x_test)
model_score_NB = clf_NB.score(x_test, y_test)
print("Train Accuracy with Naive Bayes = ",clf_NB.score(x_train, y_train) )
print("Test Accuracy with Naive Bayes = ",model_score_NB)
Train Accuracy with Naive Bayes =  0.6838235294117647
Test Accuracy with Naive Bayes =  0.7627118644067796

4. SVM classifier

In [83]:
from sklearn import svm
clf_svm = svm.SVC(gamma=0.025, C=3)  
clf_svm.fit(x_train , y_train)
y_pred = clf_svm.predict(x_test)
print("Train Accuracy %0.2f " % (clf_svm.score(x_train, y_train)))
print("Accuracy %0.2f " % (clf_svm.score(x_test, y_test)))
Train Accuracy 0.90 
Accuracy 0.92 
In [84]:
print("Train Accuracy with Logistic Regression = ",clf_lr.score(x_train, y_train) )
print("Test Accuracy with Logistic Regression = ",model_score)
print("Train Accuracy with KNN = ",clf_NNH.score(x_train, y_train) )
print("Test Accuracy with KNN = ",clf_NNH.score(x_test, y_test))
print("Train Accuracy with Naive Bayes = ",clf_NB.score(x_train, y_train) )
print("Test Accuracy with Naive Bayes = ",model_score_NB)
print("Train Accuracy in SVM %0.2f " % (clf_svm.score(x_train, y_train)))
print("Test Accuracy in SVM %0.2f " % (clf_svm.score(x_test, y_test)))
Train Accuracy with Logistic Regression =  0.8676470588235294
Test Accuracy with Logistic Regression =  0.8813559322033898
Train Accuracy with KNN =  1.0
Test Accuracy with KNN =  0.8983050847457628
Train Accuracy with Naive Bayes =  0.6838235294117647
Test Accuracy with Naive Bayes =  0.7627118644067796
Train Accuracy in SVM 0.90 
Test Accuracy in SVM 0.92 

Task 7: Train a meta-classifier and note the accuracy on test data (10 points)

In [97]:
from mlxtend.classifier import StackingClassifier
from sklearn import model_selection
from mlxtend.classifier import StackingCVClassifier
In [101]:
sclf = StackingCVClassifier(classifiers=[clf_NNH,clf_NB,clf_svm], 
                          meta_classifier=clf_lr,random_state=42)
#sclf = StackingClassifier(classifiers=[clf_NNH,clf_NB,clf_svm], 
 #                         meta_classifier=clf_lr)

print('3-fold cross validation:\n')

for clf, label in zip([clf_NNH,clf_NB,clf_svm, sclf], 
                      ['KNN', 
                       'Naive Bayes',
                       'SVM',
                       'StackingClassifier']):

    scores = model_selection.cross_val_score(clf, x, y, 
                                              cv=3, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" 
          % (scores.mean(), scores.std(), label))
    
3-fold cross validation:

Accuracy: 0.79 (+/- 0.04) [KNN]
Accuracy: 0.68 (+/- 0.01) [Naive Bayes]
Accuracy: 0.74 (+/- 0.02) [SVM]
Accuracy: 0.80 (+/- 0.04) [StackingClassifier]

Train a meta-classifier and note the accuracy on test data (10 points)

Training accuracy = 0.90, Test accuracy = 0.92

In [128]:
sclf.fit(x_train,y_train)
print("Train Accuracy %0.2f " % (sclf.score(x_train, y_train)))
print("Accuracy %0.2f " % (sclf.score(x_test, y_test)))
Train Accuracy 0.90 
Accuracy 0.92 

Task 8 : Train at least one standard Ensemble model - Random forest, Bagging, Boosting etc, and note the accuracy (10 points)

1. RandomForest classifier

In [106]:
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
rfcl = RandomForestClassifier(n_estimators = 50, random_state=1,max_features=12)
rfcl = rfcl.fit(x_train, y_train)
In [113]:
y_predict = rfcl.predict(x_test)
print("Accuracy of RandomForest model = ",rfcl.score(x_test, y_test))
cm=confusion_matrix(y_test, y_predict,labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["0","1"]],
                  columns = [i for i in ["0","1"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g');
Accuracy of RandomForest model =  0.9152542372881356

2. Bagging

In [132]:
from sklearn.ensemble import BaggingClassifier

bgcl = BaggingClassifier(base_estimator=dTree, n_estimators=50,random_state=1)
#bgcl = BaggingClassifier(n_estimators=50,random_state=1)

bgcl = bgcl.fit(x_train, y_train)
In [133]:
from sklearn.metrics import confusion_matrix

y_predict = bgcl.predict(x_test)

print("Bagging classifier accuracy = ",bgcl.score(x_test , y_test))

cm=confusion_matrix(y_test, y_predict,labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["0","1"]],
                  columns = [i for i in ["0","1"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g');
Bagging classifier accuracy =  0.8813559322033898

3. Boosting

AdaBoost

In [134]:
from sklearn.ensemble import AdaBoostClassifier
abcl = AdaBoostClassifier(n_estimators=10, random_state=1)
#abcl = AdaBoostClassifier( n_estimators=50,random_state=1)
abcl = abcl.fit(x_train, y_train)
In [135]:
y_predict = abcl.predict(x_test)
print("Accuracy of AdaBoost Classifier = ",abcl.score(x_test , y_test))

cm=confusion_matrix(y_test, y_predict,labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["0","1"]],
                  columns = [i for i in ["0","1"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g');
Accuracy of AdaBoost Classifier =  0.8813559322033898

GradientBoost

In [136]:
from sklearn.ensemble import GradientBoostingClassifier
gbcl = GradientBoostingClassifier(n_estimators = 50,random_state=1)
gbcl = gbcl.fit(x_train, y_train)
In [137]:
y_predict = gbcl.predict(x_test)
print("Accuracy of GradientBoost classifier = ",gbcl.score(x_test, y_test))
cm=confusion_matrix(y_test, y_predict,labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["0","1"]],
                  columns = [i for i in ["0","1"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g');
Accuracy of GradientBoost classifier =  0.9152542372881356
In [ ]:
 

Task 9: Compare all the models (minimum 5) and pick the best one among them (10 points)

In [141]:
print("Train Accuracy of meta-classifier %0.2f   " % (sclf.score(x_train, y_train)))
print("Test Accuracy of meta-classifier %0.2f   " % (sclf.score(x_test, y_test)))
print("Train Accuracy of RandomForest model = ",rfcl.score(x_train, y_train))
print("Test Accuracy of RandomForest model = ",rfcl.score(x_test, y_test))
print("Train Accuracy of Bagging classifier = ",bgcl.score(x_train, y_train))
print("Test accuracy of Bagging classifier = ",bgcl.score(x_test , y_test))
print("Train Accuracy of AdaBoost Classifier =",abcl.score(x_train , y_train))
print("Test Accuracy of AdaBoost Classifier =",abcl.score(x_test , y_test))
print("Train Accuracy of GradientBoost classifier =",gbcl.score(x_train, y_train))
print("Test Accuracy of GradientBoost classifier =",gbcl.score(x_test, y_test))
Train Accuracy of meta-classifier 0.90   
Test Accuracy of meta-classifier 0.92   
Train Accuracy of RandomForest model =  1.0
Test Accuracy of RandomForest model =  0.9152542372881356
Train Accuracy of Bagging classifier =  1.0
Test accuracy of Bagging classifier =  0.8813559322033898
Train Accuracy of AdaBoost Classifier = 0.9926470588235294
Test Accuracy of AdaBoost Classifier = 0.8813559322033898
Train Accuracy of GradientBoost classifier = 1.0
Test Accuracy of GradientBoost classifier = 0.9152542372881356

Among all the classifiers Logistic Regression, Naive Bayes’, SVM, k-NN and standard Ensemble model - Random forest, Bagging, Boosting-AdaBoost, GradientBoost classifiers

meta-classifier with Logistic Regression, Naive Bayes’, SVM, k-NN performs well with 90% train and 92% test accuracy

Also RandomForest and GradientBoost classifier performs good with train accuracy of 100 % and test accuracy of 91 %.